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The so-called pinball loss for estimating conditional quantiles is a well-known tool in both 
statistics and machine learning. So far, however, only little work has been done to quantify the 
efficiency of this tool for nonparametric approaches. We fill this gap by establishing inequalities 
that describe how close approximate pinball risk minimizers are to the corresponding condi- 
tional quantile. These inequalities, which hold under mild assumptions on the data-generating 
distribution, are then used to establish so-called variance bounds, which recently turned out to 
play an important role in the statistical analysis of (regularized) empirical risk minimization ap- 
proaches. Finally, we use both types of inequalities to establish an oracle inequality for support 
vector machines that use the pinball loss. The resulting learning rates are min-max optimal 
under some standard regularity assumptions on the conditional quantile. 
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1. Introduction 

Let P be a distribution on X x K, where X is an arbitrary set equipped with a a- 
algebra. The goal of quantile regression is to estimate the conditional quantile, that is, 
the set- valued function 

F; p (x) :={t£R:P{{-oo,t}\x) >r and P([i,oo)|a;) > 1 -r}, x £ X, 

where r £ (0, 1) is a fixed constant specifying the desired quantile level and P(-|x), x £ X, 
is the regular conditional probability of P. Throughout this paper, we assume that P( - |a;) 
has its support in [—1,1] for P^-almost all x £ X, where Px denotes the marginal 
distribution of P on A. (By a simple scaling argument, all our results can be generalized 
to distributions living on X x [— M, M] for some M > 0. The uniform boundedness of the 
conditionals P(-|x) is, however, crucial.) Let us additionally assume for a moment that 
F* P (x) consists of singletons, that is, there exists an /* P : X — > K, called the conditional 
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r-quantile function, such that F* p (x) = {/^ P (x)} f° r Px-almost all x £ X. (Most of our 
main results do not require this assumption, but here, in the introduction, it makes the 
exposition more transparent.) Then one approach to estimate the conditional r-quantile 
function is based on the so-called t -pinball loss L : Y x R — > [0, oo), which is defined by 

{y,h \r(y-t), if y>*. 

With the help of this loss function we define the L-risk of a function / : X — > K by 

7^,p(/):=E ( ^ P I,(t/,/(x)) = / L(y, f(x)) dP(x, y). 

JXxY 

Recall that f* P is up to Pjf-zero sets the only function satisfying 7?.£ iP (/* p) = 
inf IZL.p(f) ='■ pi where the inhmum is taken over all measurable functions / : X — > M. 
Based on this observation, several estimators minimizing a (modified) empirical L-risk 
were proposed (see [13] for a survey on both parametric and nonparamctric methods) for 
situations where P is unknown, but i.i.d. samples D := ((xi, yi), . . . , (x„, y n )) € (X x M) n 
drawn from P are given. 

Empirical methods estimating quantile functions with the help of the pinball loss typ- 
ically obtain functions fjj for which 7£z,,p(/d) is close to 1Z* L P with high probability. In 
general, however, this only implies that fo is close to /* P in a very weak sense (see [21], 
Remark 3.18) but recently, [23], Theorem 2.5, established self-calibration inequalities of 
the form 



11/ - /;,pIImpx) < cp^KlAJ) - K* L>P , (1) 

which hold under mild assumptions on P described by the parameter r £ (0, 1]. The first 
goal of this paper is to generalize and to improve these inequalities. Moreover, we will use 
these new self-calibration inequalities to establish variance bounds for the pinball risk, 
which in turn are known to improve the statistical analysis of empirical risk minimization 
(ERM) approaches. 

The second goal of this paper is to apply the self-calibration inequalities and the 
variance bounds to support vector machines (SVMs) for quantile regression. Recall that 
[12, 20, 26] proposed an SVM that finds a solution /d,x £ H of 

argmin X\\f\\ 2 H +TZ L , D (f), (2) 

where A > is a regularization parameter, H is a reproducing kernel Hilbert space 
(RKHS) over X, and 7^l,d(/) denotes the empirical risk of /, that is, 7£l,d(/) := 
— L(yi, f(xi)). In [9] robustness properties and consistency for all distributions 
P on I x 1 were established for this SVM, while [12, 26] worked out how to solve this 
optimization problem with standard techniques from machine learning. Moreover, [26] 
also provided an exhaustive empirical study, which shows the excellent performance of 
this SVM. We have recently established an oracle inequality for these SVMs in [23], 
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which was based on (1) and the resulting variance bounds. In this paper, we improve 
this oracle inequality with the help of the new self-calibration inequalities and variance 
bounds. It turns out that the resulting learning rates are substantially faster than those 
of [23]. Finally, we briefly discuss an adaptive parameter selection strategy. 

The rest of this paper is organized as follows. In Section 2, we present both our new self- 
calibration inequality and the new variance bound. We also introduce the assumptions 
on P that lead to these inequalities and discuss how these inequalities improve our former 
results in [23]. In Section 3, we use these new inequalities to establish an oracle inequality 
for the SVM approach above. In addition, we discuss the resulting learning rates and how 
these can be achieved in an adaptive way. Finally, all proofs are contained in Section 4. 



2. Main results 

In order to formulate the main results of this section, we need to introduce some assump- 
tions on the data-generating distribution P. To this end, let Q be a distribution on K 
and suppQ be its support. For r £ (0, 1), the r-quantile of Q is the set 

F*(Q) :={teR:Q((-oo,t]) >r and Q([t, oo)) > 1 - r}. 

It is well known that F*(Q) is a bounded and closed interval. We write 

Ci„(Q):=minF;(Q) and 4 ax (Q) := maxF r *(Q), 

which implies F*(Q) = [tmin(Q)'*max(Q)]- Moreover, it is easy to check that the interior 
of F*(Q) is a Q-zero set, that is, Q((i^ lin (Q), £m ax (Q))) = 0. To avoid notational overload, 
we usually omit the argument Q if the considered distribution is clearly determined from 
the context. 



Definition 2.1 ( Quantiles of type q). A distribution Q with suppQ C [— 1, 1] is said 
to have a r-quantile of type q € (1, oo) if there exist constants chq € (0, 2] and 6q > such 
that 



Q((C i „-5,C in ))>6 QS «- 1 , (3) 

Q((Cax,Cax + s))>6 QS «- 1 (4) 



for all s € [0,o;q]. Moreover, Q has a r-quantile of type q= 1, if Q({^min}) > and 
Q({^maxl) > 0- In this case, we define ctQ := 2 and 

b = fmin{Q({^ in }),Q({£* lax })}, «/*min^*max> 

Q ' \min{r-Q((-oo,^ in )),Q((-oo,iS lax ])-r}, if i^ in = t* max , 
where we note that 6q > in both cases. For all q>l, we finally write 7q := &QaQ -1 . 
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Since T-quantiles of type q are the central concept of this work, let us illustrate this 
notion by a few examples. We begin with an example for which all quantiles are of type 
2. 

Example 2.2. Let v be a distribution with supple [—1,1], /i be a distribution with 
supp/i C [—1, 1] that has a density h with respect to the Lebesgue measure and Q := 
av + (1 — a)p for some a G [0, 1). If h is bounded away from 0, that is, h(y) > b for 
some b > and Lebesgue-almost all y G [—1,1], then Q has a r-quantile of type q = 2 
for all r G (0,1) as simple integration shows. In this case, we set 6q := (1 — a)b and 
a Q '■= mini 1 + Cm > 1 - Cax}- 

Example 2.3. Again, let v be a distribution with suppz^ C [—1,1], p be a distribution 
with supp/i C [—1, 1] that has a Lebesgue density h, and Q := av + (1 — a)// for some 
a G [0, 1). If, for a fixed r G (0, 1), there exist constants b > and p > — 1 such that 



Lebesgue-almost surely, then simple integration shows that Q has a r-quantile of type 
q = 2+p and we may set b Q := (1 — a)b/(l +p) and a Q := min{l+^ in (Q),l-tJ lax (Q)}. 

Example 2-4- Let i/bea distribution with supp v C [—1, 1] and Q := av + (1 — a)#t* for 
some a G [0, 1), where St* denotes the Dirac measure at t* G (0, 1). If v({t*}) = 0, we then 
have Q((— oo,i*)) = av((—oc,t*)) and Q((— oo,t*]) = av((—oo,t*)) + 1 - a, and hence 
{t*} is a r-quantile of type q= 1 for all r satisfying az^((— oo,f*)) < r < a^((— oo, t*)) + 
1-a. 

Example 2.5. Let i/ be a distribution with supp^ C [—1, 1] and Q := (1 — a — f3)v + 
a ^mi„ + /^^tmax ^ 0T some a,{3 G (0, l] with a + (3 < 1. If ^([£ m i n , t max ]) = 0, we have 
Q((-oo,i min ]) = (1 - a - /3)i/((-oo,t min ]) + a and Q([< max ,oo)) = (1 - a - /})(1 - 
oOjimin])) +P- Consequently, [i m in,i m ax] is the r := (1 - a - f$)v{{—ao, t*]) +a quan- 
tile of Q and this quantile is of type q=l. 

As outlined in the introduction, we are not interested in a single distribution Q on R 
but in distributions P on X x R. The following definition extends the previous definition 
to such P. 

Definition 2.6 (Quantiles of p- average type q). Let p G (0, oo], q G [l,oo), and P 
be a distribution on X x R with suppP(-|a;) C [— 1, 1] for P x-almost all x G X . Then P 
is said to have a r-quantile of p- average type q, ifP(-\x) has a r-quantile of type q for 
V x-almost all x G X, and the function 7 : X — ¥ [0, 00] defined, for ¥ x-almost all x G X , 



h(y)>b(t* min (Q)-y) p , 
%)>%-Cax(Q)) P > 



2/e[-l,Cin(Q)] 

2/e[Cax(Q),i]- 



&2/ 



w/iere Tp(» = kp(-|z)a : 



7(x) :=7p(.| x ), 
is defined in Definition 2.1, satisfies 7 _1 G L p (Px)- 
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To establish the announced self-calibration inequality, we finally need the distance 

dist(t,A) := inf \t-s\ 

between an element t € R and an AcK. Moreover, dist(/, F* p ) denotes the function 
x i v dist(f(x),F* P (xj). With these preparations the self-calibration inequality reads as 
follows. 



Theorem 2.7. Let L be the r-pinball loss, p€ (0, oo] and q G [l,oo) be real numbers, 
and r := -^j. Moreover, let P be a distribution that has a r-quantile of p- average type 
q G [1, oo). Then, for all f : X — > [— 1, 1], we have 

|| distC/,^*^)!^^^^) < 2 1 " 1 ^ (? 1 /^||^- 1 ||^ p ^ ) (^, P (/) - 7^I iP ) 1/<? . 

Let us briefly compare the self-calibration inequality above with the one established 
in [23]. To this end, we can solely focus on the case q = 2, since this was the only case 
considered in [23] . For the same reason, we can restrict our considerations to distributions 
P that have a unique conditional r-quantilc f* P (x) for Px-almost all x G X. Then 
Theorem 2.7 yields 

ii/-/;,piu r (P X )<2!i7- 1 iii / p 2 ( p x) (^,p(/)-^.p) 1/2 

for r := -^fi- On the other hand, it was shown in [23], Theorem 2.5, that 

ii/-/;,piu,. /2 (P X )<^ii7- 1 iii / p 2 (Px) (^,p(/)-^.p) 1/2 

under the additional assumption that the conditional widths ttpr.M considered in Defi- 
nition 2.1 are independent of x. Consequently, our new self-calibration inequality is more 
general and, modulo the constant y/2, also sharper. 

It is well known that self-calibration inequalities for Lipschitz continuous losses lead 
to variance bounds, which in turn are important for the statistical analysis of ERM 
approaches; see [1, 2, 14-17, 28]. For the pinball loss, we obtain the following variance 
bound. 



Theorem 2.8. Let L be the r-pinball loss, p€ (0, oo] and q G [l,oo) be real numbers, 
and 

9 • J 2 P \ 
v := mm< — , >. 

Let P be a distribution that has a r-quantile of p- average type q. Then, for all f:X—> 
[—1,1], there exists an f* P : X — > [—1,1] with f* P (x) G F* P (x) for P x -almost all x G X 
such that 

E P (Lo / -Lof* TP f < 2 2 -V|| 7 - 1 ||t (Px) (ftL > p(/) - K* L ,?f, 
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where we used the shorthand L o f for the function (x,y) i— > L(y, f(x)) . 

Again, it is straightforward to show that the variance bound above is both more general 
and stronger than the variance bound established in [23], Theorem 2.6. 

3. An application to support vector machines 

The goal of this section is to establish an oracle inequality for the SVM defined in (2). 
The use of this oracle inequality is then illustrated by some learning rates we derive from 
it. 

Let us begin by recalling some RKHS theory (see, e.g., [24], Chapter 4, for a more 
detailed account). To this end, let k:X x X — > R be a measurable kernel, that is, a 
measurable function that is symmetric and positive definite. Then the associated RKHS 
H consists of measurable functions. Let us additionally assume that k is bounded with 
||fc||oo •= su P x eX y/k(x,x) < 1, which in turn implies that H consists of bounded functions 
and ||/||oo<||/||h for aU/eff. 

Suppose now that we have a distribution P on X x Y . To describe the approximation 
error of SVMs we use the approximation error function 

A(X) := inf \X\\ff H +n L , P (f) - R' LP , A > 0, 

where L is the r-pinball loss. Recall that [24], Lemma 5.15 and Theorem 5.31, showed that 
limA->o = 0, if the RKHS H is dense in L\{Px) and the speed of this convergence 
describes how well H approximates the Bayes L-risk 1Z* L P . In particular, [24], Corollary 
5.18, shows that A(X) < cX for some constant c > and all A > if and only if there 
exists an / <E H such that f(x) £ F* p (x) for Px-almost all x E X. 

We further need the integral operator : L2 (Px ) — >• £2 (Px ) defined by 

T fe /(-):= / k(x,-)f(x)dP x (x), feL 2 (P x ). 
Jx 

It is well known that is self-adjoint and nuclear; see, for example, [24], Theorem 
4.27. Consequently, it has at most countably many eigenvalues (including geometric 
multiplicities), which are all 11011- negative and summable. Let us order these eigenval- 
ues Ai(Tfe). Moreover, if we only have finitely many eigenvalues, we extend this finite 
sequence by zeros. As a result, we can always deal with a decreasing, non-negative se- 
quence Ai(Tfc) > A2(Tfe) > which satisfies J2iLi^i(Tk) < 00. The finiteness of this 
sum can already be used to establish oracle inequalities; see [24], Theorem 7.22. But in 
the following we assume that the eigenvalues converge even faster to zero, since (a) this 
case is satisfied for many RKHSs and (b) it leads to better oracle inequalities. To be 
more precise, we assume that there exist constants a > 1 and g£ (0,1) such that 



Xi{T k )<ai- l / e , »>1. 



(5) 
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Recall that (5) was first used in [6] to establish an oracle inequality for SVMs using the 
hinge loss, while [7, 18, 25] consider (5) for SVMs using the least-squares loss. Further- 
more, one can show (see [22]) that (5) is equivalent (modulo a constant only depending 
on g) to 

e J (id:F^i 2 (P x ))<^- 1 /( 2e ), *>1, (6) 

where ei(id:H — > L,2(Px)) denotes the ith (dyadic) entropy number [8] of the inclusion 
map from H into L2 (Pjt ) . In addition, [22] shows that (6) implies a bound on expectations 
of random entropy numbers, which in turn are used in [24], Chapter 7.4, to establish 
general oracle inequalities for SVMs. On the other hand, (6) has been extensively studied 
in the literature. For example, for m-times differentiablc kernels on Euclidean balls X of 
W l , it is known that (6) holds for g:=^..We refer to [10], Chapter 5, and [24], Theorem 
6.26, for a precise statement. Analogously, if to > d/2 is some integer, then the Sobolcv 
space H := W m (X) is an RKHS that satisfies (6) for g:= and this estimate is also 
asymptotically sharp; see [5, 11]. 

We finally need the clipping operation defined by 

f : = max{ — 1 , min{ 1 , t} } 

for all tsR. We can now state the following oracle inequality for SVMs using the pinball 
loss. 



Theorem 3.1. Let L be the r -pinball loss and P be a distribution on X x R with 
suppP(-|x) C [—1,1] for P x-almost all x G X. Assume that there exists a function 
f* P :X — !> K with f* P (x) G F* p (x) for P x -almost all x G X and constants V > 2 2 ~ d 
and 1? G [0, 1] such that 

E P (Lof-Lo /* P ) 2 < V(K L , P (f) - TZ* LtP f (7) 

for all f : X — > [—1,1]. Moreover, let H be a separable RKHS over X with a bounded 
measurable kernel satisfying H^'Hoo < 1. In addition, assume that (5) is satisfied for some 
a > 1 and g G (0, 1). Then there exists a constant K depending only on g, V , and $ such 
that, for all <r > 1, n > 1 and A > 0, we have with probability P" not less than 1 — 3e _? 
that 

n L Mf D , x )-nl P <9A(X)+30J^--+K{^-) +3(^) 



A n X \\en) \ n J 

Let us now discuss the learning rates obtained from this oracle inequality To this end, 
we assume in the following that there exist constants c > and (3 G (0,1] such that 

A(A)<cA /? , A>0. (8) 



Recall from [24], Corollary 5.18, that, for /3 — 1, this assumption holds if and only if 
there exists a r-quantile function f* p with /* p G H. Moreover, for /3 < 1, there is a tight 
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relationship between (8) and the behavior of the approximation error of the balls X~ l Bjj; 
see [24], Theorem 5.25. In addition, one can show (see [24], Chapter 5.6) that if f* P is 
contained in the real interpolation space (Li(Px),H)# l00 , see [4], then (8) is satisfied for 
[3 := #/(2 — -d). For example, if H := W m (X) is a Sobolev space over a Euclidean ball 
X C M. d of order m > d/2 and Px has a Lebesguc density that is bounded away from 
and oo, then f* p G W S {X) for some s £ (d/2,m\ implies (8) for j3 :~ s/(2m — s). 

Now assume that (8) holds. We further assume that A is determined by A n ~ rT 1 ^ , 
where 

7:=min { ^(2-^+^- g ) + g '^Tl}- (9) 

Then Theorem 3.1 shows that TtL,p(fD,\ n ) converges to W L P with rate n -7 ; see [24], 
Lemma A. 1.7, for calculating the value of 7. Note that this choice of A yields the best 
learning rates from Theorem 3.1. Unfortunately, however, this choice requires knowledge 
of the usually unknown parameters (3, 1} and g. To address this issue, let us consider 
the following scheme that is close to approaches taken in practice (see [19] for a similar 
technique that has a fast implementation based on regularization paths) . 

Definition 3.2. Let H be an RKHS over X and A := (A n ) be a sequence of finite subsets 
A„ C (0, 1]. Given a data set D := ((xx, yi), ■ ■ ■ , (x n , y n )) G (X x R)" 7 we define 

Di := ((x 1 ,y 1 ),...,(x m ,y m )), 

D 2 := ((x m+ i,y m+ i), . . . , (x n ,y n )), 

where m :~ \ n/2\ + 1 and n > 3. Then we use D\ to compute the SVM decision functions 

fD u \ := argminA||/||^ +K L . Dl (f), A e A„, 

and D2 to determine A by choosing a Ad 2 G A n such that 

^L,D 2 (/i3 1 .A D2 ) = min Kl,t> 2 (Jd u a)- 
In the following, we call this learning method, which produces the decision functions 
/z3i,a d 7 Q> training validation SVM with respect to A. 

Training validation SVMs have been extensively studied in [24], Chapter 7.4. In par- 
ticular, [24], Theorem 7.24, gives the following result that shows that the learning rate 
n~ 7 can be achieved without knowing of the existence of the parameters /3, and g or 
their particular values. 

Theorem 3.3. Let (A„) be a sequence of n~ 2 -nets A„ of (0, 1] such that the cardinality 
|A„| of A„ grows polynomially in n. Furthermore, consider the situation of Theorem 3.1 
and assume that (8) is satisfied for some (3 G (0,1]. Then the training validation SVM 
with respect to A := (A„) learns with rate n -7 , where 7 is defined by (9). 
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Let us now consider how these learning rates in terms of risks translate into rates for 

II/d,a„-/*pIImp x )^°- ( 10 ) 

To this end, we assume that P has a r-quantile of p-average type g, where we additionally 
assume for the sake of simplicity that r := ^p. < 2. Note that the latter is satisfied for all 
p if q < 2, that is, if all conditional distributions are concentrated around the quantile at 
least as much as the uniform distribution; see the discussion following Definition 2.1. We 
further assume that the conditional quantiles F* P (x) are singletons for Px -almost all 
x G X. Then Theorem 2.8 provides a variance bound of the form (7) for -d :=p/(p+ 1), 
and hence 7 defined in (9) becomes 



7 = mm 



P(p+1) 2(3 



p(2+ P -Q) + e (p+iy /3+i 



By Theorem 2.7 we consequently see that (10) converges with rate n~ 1 ^ q , where r := pqj 
(p+ 1). To illustrate this learning rate, let us assume that we have picked an RKHS H 
with /* p G H. Then we have (3 = 1, and hence it is easy to check that the latter learning 
rate reduces to 

ri -(p+i)/(3(2+P+ep))_ 

For the sake of simplicity, let us further assume that the conditional distributions do not 
change too much in the sense that p = 00. Then we have r = q, and hence 

\? D ,x n -r r ,A q ^x (11) 

X 

converges to zero with rate n^ x ^ 1+e \ The latter shows that the value of q does not 
change the learning rate for (11), but only the exponent in (11). Now note that by our 
assumption on P and the definition of the clipping operation we have 

||/.D,A„ - /*,p||oo < 2, 

and consequently small values of q emphasize the discrepancy of /d,a„ to /* P more than 
large values of q do. In this sense, a stronger average concentration around the quantile 
is helpful for the learning process. 

Let us now have a closer look at the special case q = 2, which is probably the most 
interesting case for applications. Then we have the learning rate n~ 1 /( 2 ( 1 +fi)) for 

||/r»,A„ - /r,plU 2 (P x )- 

Now recall that the conditional median equals the conditional mean for symmetric con- 
ditional distributions P(-|a;). Moreover, if H is a Sobolev space W rn (X), where m > d/2 
denotes the smoothness index and X is a Euclidean ball in M d , then H consists of con- 
tinuous functions, and [11] shows that H satisfies (5) for g := d/(2m). Consequently, we 
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see that in this case the latter convergence rate is optimal in a min-max sense [27, 29] 
if Px is the uniform distribution. Finally, recall that in the case /3 = 1 , q = 2 and p = oo 
discussed so far, the results derived in [23] only yield a learning rate of n — V^CH-e)) for 

||/r>,A„ - /t.pIUi(Px)- 

In other words, the earlier rates from [23] are not only worse by a factor of 3/2 in the 
exponent but also are stated in terms of the weaker Li(Px)-norm. In addition, [23] only 
considered the case q = 2, and hence we see that our new results are also more general. 



4. Proofs 

Since the proofs of Theorems 2.7 and 2.8 use some notation developed in [21] and [24], 
Chapter 3, let us begin by recalling these. To this end, let L be the r-pinball loss for 
some fixed r <G (0, 1) and Q be a distribution on R with suppQ C [—1, 1]. Then [21, 24] 
defined the inner L -risks by 

C L , Q (t):=J L(y,t)dQ(y), teR, 

and the minimal inner L-risk was denoted by C* h q := inf te RCi j Q(t). Moreover, we write 
■M l,q(0 + ) = {tet: Cl.q(£) = CI q} for the set of exact minimizers. 

Our first goal is to compute the excess inner risks and the set of exact minimizers for 
the pinball loss. To this end recall that (see [3], Theorem 23.8), given a distribution Q 
on R and a measurable function <? : X — > [0, oo), we have 

[ gdQ= [ Q({. 9 >s})ds. (12) 
Jw Jo 

With these preparations we can now show the following generalization of [24] , Proposition 
3.9. 

Proposition 4.1. Let L be the T-pinball loss and Q be a distribution on R with C* h q < 
oo. Then there exist q+,q- G [0,1] with q + + q_ = Q([imin'*max])' an< ^' f or a ^ t — ®> we 
have 

CL,Q(t* lllix + t)-C* LiQ = tq + + Q((t* max ,t* mllx + s))ds, (13) 

Jo 

CL, Q (Cin " *) " Q,Q = tq- + I Q((Ci„ " S, Cin)) (14) 

Jo 

Moreover, if t^ in ^ i^ ax , then we have q^ = Q({Cinl) and 1+ = Q({ < max})- Finally, 
■Ml,q(0 + ) equals the r-quantile, that is, Ml,q(0 + ) = [*mim *mJ ■ 
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Proof. Obviously, we have Q((-oo, t* nax \) + Q([^* nax ,oo)) = 1 + Q({t* max }), and hence 
we obtain r < Q((— oo, i* lax ]) < r + Q({imax})- I 11 other words, there exists a q + G [0, 1] 
satisfying < q+ < Q({A* n a X }) and 

Q((-oo,4 ax ])=r + g+ . (15) 

Let us consider the distribution Q defined by Q(A) := Q(i* nax + A) for all measur- 
able 4cK. Then it is not hard to see that imax(Q) = 0- Moreover, we obviously have 
^i,Q(^max + =C L q(t) for all t G R. Let us now compute the inner risks of L with 
respect to Q. To this end, we fix a t > 0. Then we have 



(y-t)dQ(y) = |/dQ(y)-tQ((-oo,t))+ / ydQ(y) 

»/<t J«<0 J0<y<t 



and 



(y-t)dQ(y)=/ ydQ(y)-iQ([t,oo))- / ydQ(y) 

y>t Jj/>0 JQ<y<t 

and hence wc obtain 

C L = (r - 1) / (y - 1) dQ(y) + r / (y - i) dQ(y) 

Jy<t Jy>t 

= C Li Q(0)-Tt + tQ((-oo,0))+tQ([0,t))- / ydQ(y). 



Moreover, using (12) we find 



0<j/<t 



tQ([0,t))-/ ydQ(y)=/ Q([0,i))ds- / Q([s,t))ds = tQ({0}) + / Q((0,s))ds, 

J0<y<t JO JO JO 

and since (15) implies Q((— oo, 0)) + Q({0}) = Q((— oo, 0]) = r + g+, we thus obtain 

CL,Q(Ca X + < )=CL,Q(i)* nax +tg++ / Q((Ca XI Cax + s ))ds. (16) 

J 

By considering the pinball loss with parameter 1 — r and the distribution Q defined by 
Q(A) := Q(— t* nin — A), icR measurable, we further see that (16) implies 

Ci,Q(Cin-*)=Ci,Q(Cm + ^-+ / Q((Cin " », Cin)) d*> i > 0, (17) 

JO 

where <?_ satisfies < g_ < Q({^*nin}) an d Q([*min>°°)) = 1 — T + g_. By (15) wc then 
find q+ + q-= Q([t^ n , CD' Moreover, if i^ in ^ i^, the fact Q((C in , CaJ) = yields 

9+ + <Z- = QdCnXaxD = Q({Cin» + Q({CaxD- 
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Using the earlier established q + < Q(jX* nax }) and q- < Q({i,* nin }), we then find both 
1- = Q({Ci„l) and q + = Q({Cax})- 

To prove (13) and (14), we first consider the case = ij^ax- Then (16) and (17) yield 
CL, Q (Cin = Cl-qWL* < CL, Q (t), t G R. This implies C L<Q {t)^ = C L ^{t)* m ^ = C* L Q , 
and hence we conclude that (16) and (17) are equivalent to (13) and (14), respectively. 
Moreover, in the case t* min ^ ^ ax , we have Q((Cin(Q)>Cax(Q))) = 0, which in turn 
implies Q((-oo,i^ in ]) = r and Q([^ ax ,oo)) = 1-r. For t G (C in ,i* lax ], we consequently 
find 

C L . Q (t) = (r - 1) / (y-t) dQ(y) + r / (y - 1) dQ(y) 

(18) 



= (t-1)/ ydQ(y) + T ydQ(y), 

where we used Q((— oo,i)) = Q((— oo, £* nin ]) = r and Q([t,oo)) = Q([Cnaxi°°)) = 1 — T- 
Since the right-hand side of (18) is independent of t, we thus conclude Cl,q(*) = 
C i,Q( f )L» for a11 i e (Cin^maxl- Analogously, we find^C L , Q (t) = C^qC*)^ for all 
t G Kiin^maJ. and hcnce we can, again, conclude C L ,Q(t)^ in = Ci.Q^)^ < C i>Q (*) for 
all £ G R. As in the case =t^ ax , the latter implies that (16) and (17) are equivalent 
to (13) and (14), respectively. 

For the proof of A4l,q(0 + ) = [Cin'Caxli wc first note that the previous discussion 
has already shown M l ,q(0 + ) D [Cin'Cad- Lct us assume that M l ,q( 0+ ) <£ [Cin^maxl- 
By a symmetry argument, we then may assume without loss of generality that there 
exists a t G Ml,q{0 + ) with t > ij^ax- From (13) we then conclude that q+ = and 
Q((*max>*)) = 0- Now, q + = together with (15) shows Q((— oo, t^J) = r, which in 
turn implies Q((— oo, t]) > r. Moreover, Q((iJnax>*)) = yields 

Q([t,00)) = Q([Cax,00)) - Q({Caxl) = 1 ~ Q((-0O,Cax]) = 1 " T. 

In other words, t is a r-quantile, which contradicts t > i^ax- ^ 

For the proof of Theorem 2.7 we further need the self- calibration loss of L that is 
defined by 

L(Q,t):=dist(i,M L . Q (0+)), teR, (19) 
where Q is a distribution with suppQ C [—1, 1]. Let us define the self- calibration function 

by 

<W,l( £ >Q) := i nf c L>Q {t)-ci tQ , s>o. 

t£B.:L(Q,t)>e 

Note that if, for t G R, we write e := dist(£, -Ml,q(0 + )), then we have L(Q,t) > e, and 
hence the definition of the self-calibration function yields 

S m ^l,L( d ^^LM(0 + ))^)<CLMt)-Cl Q , tGR. (20) 
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In other words, the self-calibration function measures how well an e-approximate L-risk 
minimizer t approximates the set of exact L-risk minimizcrs. 

Our next goal is to estimate the self-calibration function for the pinball loss. To this 
end we need the following simple technical lemma. 

Lemma 4.2. For a £ [0, 2] and q £ [1, oo) consider the function S : [0, 2] — > [0, oo) defined 
by 

^M-=/ £? ' ifee[0,a\, 
[h [qaP-^-aPiq-l), ifee[a,2]. 

Then, for all e £ [0,2], we have 



. 7-1 

a x 



5(e) > 

Proof. Since a < 2 and q > 1 we easily see by the definition of 5 that the assertion is 
true for e £ [0, a] . Now consider the function h : [a, 2] — s- M defined by 



h(e):=qa^ i s-a^q^l)-^-j e«, e6[a,2]. 
It suffices to show that /i(e) > for all e £ [a, 2]. To show the latter we first check that 

h'{e)=qofl- l -q(^y e 9 " 1 , e G [a,2] 

and hence we have h'ie) > for all e £ [a, 2]. Now we obtain the assertion from this, 
q£ [0,2] and 

\ 9-1 / / \ 9-1 



h(a) = a«-\-\ a" = ofl\l-l-j J>0. □ 

Lemma 4.3. Let L 6e t/ie T-pinball loss and Q be a distribution on M wii/i suppQ C 
[— 1, 1] that has a T-quantile of type q £ [1, oo). Moreover, let ojq £ (0, 2] and bq > denote 
the corresponding constants. Then, for all e £ [0,2], we have 

Proof. Since L is convex, the map 1 1— > Cz,,q(t) — C£ q is convex, and thus it is decreasing 
on (-oo,t* nin ] and increasing on [t^oo). Using M l ,q(0 + ) = [^ n ,Cax]i wc thus nnd 

M L Q (e) := {t £ K : L(Q, t) < £ } = (C in - e, C ax + e) 
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for all e > 0. Since this gives <5 max i L (e, Q) = inf t ^_ A/l£ ^ Cl,q(*) — C£ q, we obtain 

5 m ax,L,L(^Q)- m WCL,Q(Ci„-£),CL, Q (4ax + £)}-C2.Q- (21) 

Let us first consider the case q £ (l,oo). For e € [0,aq], (13) and (4) then yield 

CL, Q {t* max + e)-Cl Q =£q + + [ Q((Cax ; Cax + s))ds>6 Q f s 9 " 1 ds = g-^qe 9 , 

Jo Jo 

and, for e G [aq,2], (13) and (4) yield 

rot) /■£ 

Ci,Q(Cax + e) -Q.Q > fo Q / s 9_1 ds + 6 Q / a^" 1 ds = q^bqiqa^e - a q Q (q - 1)). 

For e G [0,2], we have thus shown C^q^*^ +e) — C£ q > q~ 1 bqS(e), where <5 is the 
function defined in Lemma 4.2 for a := ctq. 

Furthermore, in the case q = 1 and i* lin ^ iJnax; Proposition 4.1 shows g+ = Q({im ax }), 
and hence (13) yields Cz, i Q(i* nax + e) - C£ Q > eq + > b^e for all e G [0, 2] = [0, aq] by the 
definition of 6q and aq. In the case q = 1 and t* lin = i„ lax , (15) yields g + = Q((— oo,i*]) — 
r > 6q by the definition of 6q, and hence (13) again gives Ct.q^max + e) — C* L q > 6qe for 
all e G [0, 2]. Finally, using (14) instead of (13), we can analogously show C,L,Q(imi n — s) — 
C£. Q > q^bQSie) for all e G [0,2] and q > 1. By (21) we thus conclude that 

for all e G [0, 2]. Now the assertion follows from Lemma 4.2. □ 

Proof of Theorem 2.7. For fixed x G X we write £ := dist(/(x),7V{ ij p(.| x )(0 + )). By 
Lemma 4.3 and (20) we obtain, for Pjf -almost all x £ X, 

\dist(f(x),M L , P{ . lx) (0 + ))\ q < q2 Ci - 1 i- 1 {x)8 m ^ L L {s i n-^)) 

< q2*- 1 1 -\x){C L . v( . ]x) {f{x))~Cl^ x) ). 

By taking the ^jrj-th power on both sides, integrating and finally applying Holder's 
inequality, we then obtain the assertion. □ 

Proof of Theorem 2.8. Let >• [—1,1] be a function. Since F* p (x) is closed, 

there then exists a P^-almost surely uniquely determined function /* P :X— ^ [—1,1] 
that satisfies both 

f*.p(x) £ F* tP (x), 
\f(x)-f*(x)\=dist(f(x),F*(x)) 
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for Pv-almost all x £ X. Let us write r := -^2- We first consider the case r < 2, that is, 

§ < Using the Lipschitz continuity of the pinball loss L and Theorem 2.7 we then 
obtain 

E P (L o / - L o / T * P ) 2 < E Px |/ - /*p| 2 

<ii/-/;,pii 2 ooT r Ep A -i/-/;pr 

<2 3 -V / '||7- 1 |lL / /(Px)(^(/)-^,p) r/9 - 

Since - = — tt = ?9, we thus obtain the assertion in this case. Let us now consider the 
case r > 2. The Lipschitz continuity of L and Theorem 2.7 yield 

E P (L o / - L o /* P ) 2 <(E P (Lo/-Lo / T * P ) r ) 2/r 
<(E Px |/-/ T * P n 2/r 

<(2 i - i v /9 ii7- 1 ni / ; ( p X )^. p ^)-^.p) 1/9 ) 2 
=2 2 - 2 v /9 ii7- 1 ni / ; ( Px)(^(/)-np) 2/9 - 

Since for r > 2 we have d = 2/q, we again obtain the assertion. □ 

Proof of Theorem 3.1. As shown in [22], Lemma 2.2, (5) is equivalent to the entropy 
assumption (6), which in turn implies (see [22], Theorem 2.1, and [24], Corollary 7.31) 

E Dx ^ P ™.e,(id:ff ^L 2 (D Y )) < cVHr 1/(2e) , i>l, (22) 

where Dx denotes the empirical measure with respect to Dx = (aji, ■ ■ • ,x n ) and c > 1 is 
a constant only depending on g. Now the assertion follows from [24], Theorem 7.23, by 
considering the function /o € i? that achieves AH/ollfj + 7^l,p(/o) — H*l p = ^(A). □ 
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